pennies |>ggplot(aes(x = year)) +geom_histogram(binwidth =5, col ='white', fill ='steelblue') +scale_x_continuous(breaks =seq(1945, 2011, by =5))
Keep it Going
Does this population seem normally distributed to you? Why or why not?
Keep it Going
pennies |>ggplot(aes(x = year, y ='')) +geom_violin(fill ='steelblue', alpha =0.5) +geom_boxplot(width =0.3) +stat_summary(fun ='mean', geom ='point', shape =23, size =5, fill ='orange') +labs(caption ='Orange diamond indicates the population mean.')
Last One
If the points in your Q-Q plot don’t follow the straight line closely, it may not be reasonable to assume that your population is normally distributed.
Last One
pennies |>ggplot(aes(sample = year)) +geom_qq_line() +geom_qq(col ='steelblue', alpha =0.5, size =3) +labs(title ='Q-Q Plot for a Normal Distribution in Minting Year', subtitle =paste('Q-Q stands for Quartile-Quartile, because it plots', 'the theoretical distribution against the observed.', sep ='\n'), caption ='The points will all fall on the straight line when the distribution is perfectly normal.')
Calculating population parameters
You can use functions from the kableExtra package to turn your raw R outputs into attractive tables.
The sample means \(\bar{x}\) from a sample of size \(n=50\) from this population should follow the distribution \(\text{N}(1989.8, 1.76)\) if assumptions hold.
Storing sampling distribution estimates
Use the pull() function from dplyr to grab just the value(s) from a column, ditching the data frame component.
From their study population of 800 pennies, they took a sample of 50 pennies.
The Sample
The pennies in the sample are labeled with a unique ID number and the year that they were minted.
Taking a sample, pt. 1
When doing random processes in R, you need to use the set.seed() function and give it a number. This temporarily “fixes” the randomness so that the function generates the same set of numbers every time.
# store number of observationsn <-nrow(pennies)# randomly sample row numbers/indexes# set a seed to reproduce exact sample next timeset.seed(123); sample_rows <-sample(seq(1, n), size =50)sort(sample_rows)
When doing random processes in R, you need to use the set.seed() function and give it a number. This temporarily “fixes” the randomness so that the function generates the same set of numbers every time.
# store number of observationsn <-nrow(pennies)# randomly sample row numbers/indexes# set a seed to reproduce exact sample next timeset.seed(123); sample_rows <-sample(seq(1, n), size =50)sort(sample_rows)
Use the filter() function from the dplyr package to retain observations whose row number matches your generated sample.
# filter your table for rows whose# number/index is in your sample listsample1 <- pennies |># function from dplyr packagefilter(row_number() %in% sample_rows)nrow(sample1)
[1] 50
Or use base R to index the proper rows. The 1st parameter contains the row numbers to keep. The 2nd parameter contains the column numbers to keep. Leave blank if you want them all.
sample1 <- pennies[sample_rows, ]
What about another sample?
Set a different seed with the set.seed() function to generate a new (but reproducible) random sample.
# randomly sample row numbers/indexes# set a seed to reproduce exact sample next timeset.seed(456); sample_rows2 <-sample(seq(1, n), size =50)# Select sampled rows through indexingsample2 <- pennies[sample_rows2, ]
Resampling with Infer
You can use the rep_sample_n() function from the infer package to take reps number of repeated samples of a specified size, with (replace = T) or without (replace = F) replacement.
# don't forget to set a seed!set.seed(789); samples <- pennies %>%rep_sample_n(size =50, replace =TRUE, reps =100)head(samples)
p1 <- cis |># from the dplyr package, for modifying datamutate(contains_mu =ifelse( pop_mean >= lower95 &#conditional statement pop_mean <= upper95, T, F)) |># value to return if true, if falseggplot(aes(x = replicate)) +geom_ribbon(aes(ymin = pop_low, ymax = pop_high), alpha =0.1) +geom_hline(yintercept = pop_mean, linetype ='dashed') +coord_cartesian(ylim =c(min(cis$lower95), max(cis$upper95)))p1
Visualizing the CI’s
Visualizing the CI’s
p2 <- p1 +geom_pointrange(aes(y = mean, ymin = lower95, ymax = upper95, col = contains_mu, fill = contains_mu))p2
Visualizing the CI’s
Visualizing the CI’s
p3 <- p2 +theme(legend.position ='bottom') +labs(title ='95% Confidence Intervals for the Mean Year of Penny Minting', subtitle ='Inferences from Samples vs Population Parameters', caption ='Pop. Mean = 1989.8, Pop. SD = 12.4', x ='Replicate', y ='Year')p3
All Together
Reordered
Reordered
cis |>mutate(contains_mu =ifelse( pop_mean >= lower95 & pop_mean <= upper95, T, F)) |>ggplot(aes(x =reorder(replicate, mean))) +geom_ribbon(aes(ymin = pop_low, ymax = pop_high), alpha =0.1) +geom_hline(yintercept = pop_mean, linetype ='dashed') +coord_cartesian(ylim =c(min(cis$lower95), max(cis$upper95))) +geom_pointrange(aes(y = mean, ymin = lower95, ymax = upper95, col = contains_mu, fill = contains_mu)) +theme(legend.position ='bottom', axis.text.x =element_blank()) +labs(title ='95% Confidence Intervals for the Mean Year of Penny Minting', subtitle ='Inferences from Samples vs Population Parameters', caption ='Pop. Mean = 1989.8, Pop. SD = 12.4', x ='Replicate', y ='Year')
Misinterpreting Confidence Intervals
Taking a sample and constructing a confidence interval are considered random processes
Confidence is an expression of probability regarding the PROCESS, not the result
The confidence level is a pre-analysis statement about the process, not a post-analysis statement about the result
95% confidence means that 95% of the time, confidence intervals constructed using samples of size n from this population will contain the population parameter…
…meaning there’s a 5% chance (alpha) that your confidence interval does NOT contain the population parameter no matter how good your data is!
Misinterpreting Confidenc Intervals
A “true” population parameter is unknownable. You should not be using this terminology to interpret your results.
If you have made reasonable assumptions, theory suggests (and one hopes) the point estimate or sample statistic will be close to that “true” population parameter we can’t know.
However, if any assumptions are violated, the sample statistics may no longer be accurate estimators of the population parameters.
The confidence level is not the probability that YOUR confidence interval contains the population parameter.